Predicting Mental Health Problems from Social Determinants of Health and Caregiving Activities of Caregivers of Persons with Dementia: A Machine Learning Approach
BMIN503/EPID600 Final Project
Author
Hannah Cho
1 Overview
In the United States, as of 2021, approximately 34 million informal caregivers were providing unpaid support to persons living with dementia (PLWD). As dementia progresses to its end-stage or terminal phase, individuals with the condition often become fully dependent on caregivers for assistance with most daily activities, such as bathing, repositioning, and other essential personal care needs. Notably, 80% of PLWD receive care from informal caregivers—primarily spouses, adult children, and close friends—within community settings rather than institutionalized environments. The project addressed public health issues related to dementia caregiving, which directly affect caregivers’ emotional problems, such as depression and anxiety, by predicting key caregiving activities and sociodemographic features of those at high risk.
This project was conducted after consultations with Dr. George Demiris, a recognized expert in the field of dementia care and caregiving, and Dr. Huang, an experienced biostatistician. Their contributions provided critical insights into both the substantive and methodological aspects of the research, ensuring a rigorous and well-informed approach to addressing the challenges faced by caregivers of PLWD.
2 Introduction
In the United States, as of 2021, 34 million informal caregivers were providing unpaid support to persons living with dementia (PLWD; Alzheimer’s Association, 2023; McCabe et al., 2016; Reinhard et al., 2023). As PLWD progress to the end-stage or terminal stage of dementia, they often become dependent on assistance for most daily activities, including bathing and changing position. Of all PLWD, 80% receive care from individuals—often spouses, adult children, and close friends—in the community (Alzheimer’s Association, 2023).
Dementia caregivers face profoundly complex and multifaceted challenges that extend beyond the physical and emotional demands of caregiving. These challenges encompass physical tasks, such as providing continuous care, assisting with activities of daily living, and managing the multifarious health complications associated with dementia. However, the toll of caregiving is not solely physical; it also places a considerable emotional burden on caregivers. Many experience elevated levels of stress, anxiety, depression, and isolation, as the exhaustive nature of caregiving leaves little opportunity for self-care or social engagement.
Caregivers for persons living with dementia (PLWD) are particularly vulnerable to anxiety and depression due to the progressive and unpredictable trajectory of the disease. This trajectory often varies significantly based on individuals’ pre-existing conditions and comorbidities, adding to caregivers’ uncertainty and emotional strain.
Beyond these personal and emotional burdens, caregivers frequently encounter substantial social and systemic barriers that hinder their ability to access necessary support. These barriers include financial strain, a lack of accessible and affordable respite care, limited awareness of available resources, and cultural or social stigmas surrounding caregiving. Furthermore, caregivers often experience isolation due to diminished social engagement, compounding the psychological toll.
This study aims to explore how specific caregiving activities, individual characteristics, and sociodemographic factors predict the likelihood of anxiety and depression in dementia caregivers. Additionally, it investigates how effectively machine learning models can forecast these mental health outcomes, offering a novel approach to identifying caregivers at greatest risk and informing targeted interventions.
Problem Statement: Dementia caregiving is a demanding role, often accompanied by significant psychological challenges, including anxiety and depression. However, the specific caregiving activities, caregiver characteristics, and sociodemographic factors that most strongly predict these mental health outcomes remain insufficiently understood. Furthermore, while traditional statistical approaches have provided valuable insights, the potential of machine learning models to accurately forecast anxiety and depression in dementia caregivers remains underexplored. This research aims to identify the key predictors of these mental health outcomes and evaluate the predictive accuracy of machine learning algorithms in supporting early identification and targeted interventions for at-risk caregivers.
Research Question: This research aims to identify the key predictors of mental health outcomes and evaluate the predictive accuracy of machine learning algorithms in supporting the early identification of at-risk caregivers.
3 Methods
Dataset: This study used data from the National Health and Aging Trends Study (NHATS) Round 11 and the National Study of Caregiving (NSOC) Round 4, which include data collected in 2021. The NHATS is a publicly accessible dataset that includes a nationally representative sample of adults aged 65 years and older who are Medicare beneficiaries in the United States of America. The NHATS began in 2011 and included 8,245 participants who have been followed up annually since then; the study goal is to encourage research to maximize health and enhance the quality of life of older adults. The NSOC is conducted alongside the NHATS; participants in the NSOC are caregivers for older adults included in the NHATS. Both the NHATS and the NSOC were funded by the National Institute on Aging (R01AG062477; U01AG032947). When used together, the NHATS and NSOC provide valuable information on dyads of older adults receiving care and their family caregivers.
Samples: PLWD with dementia: Probable dementia was identified based on one of the following criteria: a self-reported diagnosis of dementia or Alzheimer’s disease by a physician, a score of 2 or higher on the AD8 screening instrument administered to proxy respondents, or a score that is 1.5 standard deviations below the mean on a range of cognitive tests.Caregivers: Caregivers are identified from the NSOC and NHATS data set. Since this project specifically aims to explore caregivers of persons with dementia in the community, the sample was further filtered through dementia classification (demclass) and residency (r11dresid).
Afer retriving NHATS Round 11 and NSOC ROUND 4, I specifically selected the sample (from NHATS R11- r11demclas). And then, I merged those necessary datasets.
As the purpose of this study was to examine family caregivers who provide care to persons living with dementia at home, we selected those who provided care to persons living with dementia and lived at home using the Dementia Classification with Programming Statements provided by the NHATS.
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2) #data visualization library(gtsummary) #summary statistics#Bring datasets df1 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NHATS_Round_11_SP_File_V2.dta") # dementia classfication in this filedf2 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NSOC_r11.dta") #caregiver information 1df3 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NSOC_cross.dta") #caregiver information 2df4 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NHATS_Round_11_OP_File.dta") #older adults information
3.2 Choosing Probable and Possible Dementia Participants
#need to clean df1 first in order to classify dementia classes #ENTER WHICH ROUND?sp1 <- df1 |>mutate(rnd =11) #3. EDIT ROUND NUMBER INSIDE THE QUOTES #(THIS REMOVES THE PREFIXES ON NEEDED VARIABLES ) sp1 <- sp1 |>rename_all(~stringr::str_replace(.,"^r11","")) |>rename_all(~stringr::str_replace(.,"^hc11","")) |>rename_all(~stringr::str_replace(.,"^is11","")) |>rename_all(~stringr::str_replace(.,"^cp11","")) |>rename_all(~stringr::str_replace(.,"^cg11",""))#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1sp1 <- sp1 |>mutate(dad8dem =ifelse(rnd ==1, -1, dad8dem))#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1sp1 <- sp1 |>mutate(dad8dem =ifelse(rnd ==1, -1, dad8dem))#SUBSET NEEDED VARIABLESdf<-sp1 |> dplyr::select(spid, rnd, dresid, resptype, disescn9, chgthink1, chgthink2, chgthink3, chgthink4, chgthink5, chgthink6, chgthink7, chgthink8, dad8dem, speaktosp, todaydat1, todaydat2, todaydat3, todaydat4, todaydat5, presidna1, presidna3, vpname1, vpname3, quesremem, dclkdraw, atdrwclck, dwrdimmrc, dwrdlstnm, dwrddlyrc)#FIX A ROUND 2 CODING ERROR#df <- df |>mutate(dwrdimmrc =ifelse(dwrdimmrc==10& dwrddlyrc==-3& rnd==2, -3, dwrdimmrc))#CREATE SELECTED ROUND DEMENTIA CLASSIFICATION VARIABLE df <- df |>mutate(demclas =ifelse(dresid==3| dresid==5| dresid==7, -9, #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)ifelse((dresid==4& rnd==1) | dresid==6| dresid==8, -1, #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)ifelse((disescn9==1| disescn9==7) &#CODE PROBABLE IF DEMENTIA DIAGNOSIS REPORTED BY SELF OR PROXY* (resptype==1| resptype==2), 1, NA))))#CODE AD8_SCORE*#INITIALIZE COUNTS TO NOT APPLICABLE*#ASSIGN VALUES TO AD8 ITEMS IF PROXY AND DEMENTIA CLASS NOT ALREADY ASSIGNED BY REPORTED DIAGNOSIS for(i in1:8){ df[[paste("ad8_", i, sep ="")]] <-as.numeric(ifelse(df[[paste("chgthink", i, sep ="")]]==2& df$resptype==2&is.na(df$demclas), 0, #PROXY REPORTS NO CHANGEifelse((df[[paste("chgthink", i, sep ="")]]==1| df[[paste("chgthink", i, sep ="")]] ==3) & df$resptype==2&is.na(df$demclas), 1, #PROXY REPORTS A CHANGE OR ALZ/DEMENTIA*ifelse(df$resptype==2&is.na(df$demclas), NA, -1)))) #SET TO NA IF IN RES CARE AND demclass=., OTHERWISE AD8 ITEM IS SET TO NOT APPLICABLE }#INITIALIZE COUNTS TO NOT APPLICABLE*for(i in1:8){ df[[paste("ad8miss_", i, sep ="")]] <-as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep ="")]]), 1,ifelse((df[[paste("ad8_", i, sep ="")]]==0| df[[paste("ad8_", i, sep ="")]]==1) & df$resptype==2&is.na(df$demclas), 0, -1)))}for(i in1:8){ df[[paste("ad8_", i, sep ="")]] <-as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep ="")]]) &is.na(df$demclas) & df$resptype==2, 0, df[[paste("ad8_", i, sep ="")]]))}#COUNT AD8 ITEMS#ROUNDS 2+df <- df |>mutate(ad8_score =ifelse(resptype==2&is.na(demclas), (ad8_1 + ad8_2 + ad8_3 + ad8_4 + ad8_5 + ad8_6 + ad8_7 + ad8_8), -1)) %>%#SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 mutate(ad8_score =ifelse(dad8dem==1& resptype==2&is.na(demclas), 8, ad8_score)) %>%#SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 FOR ROUNDS 4-9mutate(ad8_score =ifelse(resptype==2& dad8dem==-1& chgthink1==-1& (rnd>=4& rnd<=9) &is.na(demclas) , 8, ad8_score)) #COUNT MISSING AD8 ITEMSdf <- df |>mutate(ad8_miss =ifelse(resptype==2&is.na(demclas),(ad8miss_1+ad8miss_2+ad8miss_3+ad8miss_4+ad8miss_5+ad8miss_6+ad8miss_7+ad8miss_8), -1))#CODE AD8 DEMENTIA CLASS #IF SCORE>=2 THEN MEETS AD8 CRITERIA#IF SCORE IS 0 OR 1 THEN DOES NOT MEET AD8 CRITERIAdf <- df |>mutate(ad8_dem =ifelse(ad8_score>=2, 1,ifelse(ad8_score==0| ad8_score==1| ad8_miss==8, 2, NA)))#UPDATE DEMENTIA CLASSIFICATION VARIABLE WITH AD8 CLASSdf <- df |>#PROBABLE DEMENTIA BASED ON AD8 SCORE mutate(demclas =ifelse(ad8_dem==1&is.na(demclas), 1, #NO DIAGNOSIS, DOES NOT MEET AD8 CRITERION, AND PROXY SAYS CANNOT ASK SP COGNITIVE ITEMS*ifelse(ad8_dem==2& speaktosp==2&is.na(demclas), 3, demclas)))####CODE DATE ITEMS AND COUNT #CODE ONLY YES/NO RESPONSES: MISSING/NA CODES -1, -9 LEFT MISSING*#2: NO/DK OR -7: REFUSED RECODED TO : NO/DK/RF*#****ADD NOTES HERE ABOUT WHAT IS HAPPENING IN ROUNDS 1-3, 5+ VS. ROUND 4 #*for(i in1:5){ df[[paste("date_item", i, sep ="")]] <-as.numeric(ifelse(df[[paste("todaydat", i, sep ="")]]==1, 1,ifelse(df[[paste("todaydat", i, sep ="")]]==2| df[[paste("todaydat", i, sep ="")]]==-7, 0, NA)))}#COUNT CORRECT DATE ITEMSdf <- df |>mutate(date_item4 =ifelse(rnd==4, date_item5, date_item4)) %>%mutate(date_sum = date_item1 + date_item2 + date_item3 + date_item4) %>%#PROXY SAYS CAN'T SPEAK TO SPmutate(date_sum =ifelse(speaktosp==2&is.na(date_sum),-2, #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER*ifelse((is.na(date_item1) |is.na(date_item2) |is.na(date_item3) |is.na(date_item4)) & speaktosp==1,-3, date_sum))) %>%#MISSING IF PROXY SAYS CAN'T SPEAK TO SP* mutate(date_sumr =ifelse(date_sum ==-2 , NA, #0 IF SP UNABLE TO ANSWER*ifelse(date_sum ==-3 , 0, date_sum)))########PRESIDENT AND VICE PRESIDENT NAME ITEMS AND COUNT########## ##CODE ONLY YES/NO RESPONSES: MISSING/N.A. CODES -1,-9 LEFT MISSING *##2:NO/DK OR -7:REFUSED RECODED TO 0:NO/DK/RF*df <- df |>mutate(preslast =ifelse(presidna1 ==1, 1,ifelse(presidna1 ==2| presidna1 ==-7, 0, NA))) |>mutate(presfirst =ifelse(presidna3 ==1, 1,ifelse(presidna3 ==2| presidna3 ==-7, 0, NA))) |>mutate(vplast =ifelse(vpname1 ==1, 1,ifelse(vpname1 ==2| vpname1 ==-7, 0, NA))) |>mutate(vpfirst =ifelse(vpname3 ==1, 1,ifelse(vpname3 ==2| vpname3 ==-7, 0, NA))) |>#COUNT CORRECT PRESIDENT/VP NAME ITEMS*mutate(presvp = preslast + presfirst + vplast + vpfirst) |>#PROXY SAYS CAN'T SPEAK TO SP mutate(presvp =ifelse(speaktosp ==2&is.na(presvp), -2, #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER ifelse((is.na(preslast) |is.na(presfirst) |is.na(vplast) |is.na(vpfirst)) & speaktosp==1&is.na(presvp),-3, presvp))) |>#MISSING IF PROXY SAYS CAN’T SPEAK TO SP*mutate(presvpr =ifelse(presvp ==-2 , NA, ifelse(presvp ==-3 , 0, presvp))) |>#ORIENTATION DOMAIN: SUM OF DATE RECALL AND PRESIDENT/VP NAMING* mutate(date_prvp = date_sumr + presvpr)#######EXECUTIVE FUNCTION DOMAIN: CLOCK DRAWING SCORE###########RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUND 10 ONLY)* df <- df |>mutate(dclkdraw =ifelse(speaktosp ==2& dclkdraw ==-9& rnd==10, -2,ifelse(speaktosp==1& (quesremem==2| quesremem==-7| quesremem==-8) & dclkdraw==-9& rnd==10, -3,ifelse(atdrwclck==2& dclkdraw==-9& rnd==10, -4,ifelse(atdrwclck==97& dclkdraw==-9& rnd==10, -7, dclkdraw)))))#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUNDS 11 AND FORWARD ONLY)* df<-df |>mutate(dclkdraw =ifelse(speaktosp ==2& dclkdraw ==-9& rnd>=11, -2, ifelse(speaktosp ==1& (quesremem ==2| quesremem ==-7| quesremem ==-8) & dclkdraw ==-9, -3& rnd>=11, dclkdraw))) df<-df |>mutate(clock_scorer =ifelse(dclkdraw ==-3| dclkdraw ==-4| dclkdraw ==-7, 0,#IMPUTE MEAN SCORE TO PERSONS MISSING A CLOCK*#IF PROXY SAID CAN ASK SP*ifelse(dclkdraw ==-9& speaktosp ==1, 2, #IF SELF-RESPONDENT* ifelse(dclkdraw ==-9& speaktosp ==-1, 3, ifelse(dclkdraw ==-2| dclkdraw ==-9, NA, dclkdraw)))))#MEMORY DOMAIN: IMMEDIATE AND DELAYED WORD RECALL df <- df |>mutate(irecall =ifelse(dwrdimmrc ==-2| dwrdimmrc ==-1, NA,ifelse(dwrdimmrc ==-7| dwrdimmrc ==-3, 0, dwrdimmrc))) |>mutate(irecall =ifelse(rnd==5& dwrddlyrc==-9, NA, irecall)) |>#round 5 only: set cases with missing word list and not previously assigned to missingmutate(drecall =ifelse(dwrddlyrc ==-2| dwrddlyrc ==-1, NA,ifelse(dwrddlyrc ==-7| dwrddlyrc ==-3, 0, dwrddlyrc))) |>mutate(drecall =ifelse(rnd==5& dwrddlyrc==-9, NA, drecall)) |>#round 5 only: set cases with missing word list and not previously assigned to missingmutate(wordrecall0_20 = irecall+drecall)#CREATE COGNITIVE DOMAINS FOR ALL ELIGIBLE df<-df |>mutate(clock65 =ifelse(clock_scorer ==0| clock_scorer==1, 1, ifelse(clock_scorer >1& clock_scorer<6, 0, NA)))df<-df |>mutate(word65 =ifelse(wordrecall0_20 >=0& wordrecall0_20 <=3, 1, ifelse(wordrecall0_20 >3& wordrecall0_20 <=20, 0, NA)))df<-df |>mutate(datena65 =ifelse(date_prvp >=0& date_prvp <=3, 1, ifelse(date_prvp >3& date_prvp <=8, 0, NA)))# *CREATE COGNITIVE DOMAIN SCORE*df<-df |>mutate(domain65 = clock65+word65+datena65)#*SET CASES WITH MISSING WORD LIST AND NOT PREVIOUSLY ASSIGNED TO MISSING (ROUND 5 ONLY)df<-df |>mutate(demclas =ifelse(rnd==5& dwrdlstnm==-9&is.na(demclas), -9, demclas))#UPDATE COGNITIVE CLASSIFICATION*df<-df |>#PROBABLE DEMENTIAmutate(demclas =ifelse(is.na(demclas) & (speaktosp ==1| speaktosp ==-1) & (domain65==2| domain65==3), 1,#POSSIBLE DEMENTIAifelse(is.na(demclas) & (speaktosp ==1| speaktosp ==-1) & domain65==1, 2,#NO DEMENITA ifelse(is.na(demclas) & (speaktosp ==1| speaktosp ==-1) & domain65==0, 3, demclas))))#KEEP VARIABLES AND SAVE DATAdf<-df |> dplyr::select(spid, rnd, demclas)#CHANGE # AFTER "r" TO THE ROUND OF INTERESTr11demclas <- df#4. NAME AND SAVE DEMENTIA DATA FILE:#CHANGE # AFTER "r" TO THE ROUND OF INTERESTsave(r11demclas, file ="~/R HC/BMIN503_Final_Project/final final/NHATS_r11.dta")
Once dementia class (demclas) is identified it is saved in the dataset ‘df1’.
3.3 Merging Datasets
#merged datasets (md). md1 <- left_join(df, df1, by = "spid") md2 <- left_join(md1, df3, by = "spid")#merged datasets (md). md1 <-left_join(df, df1, by ="spid")md2 <-left_join(md1, df3, by ="spid")# choose probable dementia and dementia patients who live at homedementia1 <- md2 |>filter(demclas %in%c("1", "2") & (r11dresid %in%c("1")))dementia2 <- md2 |>filter(demclas %in%c("1", "2") & (r11dresid %in%c("1", "2")))
3.4 Preliminary Table Manipulation
Before creating the subset of the dataset for analysis, I will recode the variables. After reviewing the literature on dementia caregivers and social determinants of health, I selected the following variables.
Predictors: Caregiver level factors are identified as caregivers’ age, race, gender, self-reported income, and the highest education level. Also, these are recoded accordingly. The education level of the caregivers was categorized as “Less than high school (0)”, “High School (1)”, and “College or above (2).” For economic status, the caregivers’ reported income from the previous year was used. This study included both informal and formal support as part of the caregivers’ social determinants of health. Informal support included having friends or family (a) to talk to about important life matters, (b) to help with daily activities, such as running errands, and (c) to assist with care provision.10 Formal support included (a) participation in a support group for caregivers, (b) access to respite services that allowed the caregiver to take time off, and (c) involvement in a training program that assisted the caregiver in providing care for the care recipient.10 We used these individual items as support questions and each support question was answered by indicating whether or not they received support.
# Caregiver's Age #chd11dage#Race# Recode `race` to create a new binary variable# 1 for "White, non-Hispanic" and 0 for "Non-White"# Recode race to create a new variable 'race_recode'dementia1 <- dementia1 |>mutate(race_recode =case_when( crl11dcgracehisp ==1~0, # White, non-Hispanic crl11dcgracehisp ==2~1, #black, non-hispanic crl11dcgracehisp ==3~2, # others crl11dcgracehisp ==4~3, chd11educ %in%c(5, 6) ~NA_real_, # Missing or not applicableTRUE~NA_real_# Unhandled cases# hispanics )) table(dementia1$crl11dcgracehisp)
1 2 3 4 5 6
250 193 16 46 1 22
table(dementia1$race_recode)
0 1 2 3
250 193 16 46
# Gender: Male as reference (0), Female as 1dementia1 <- dementia1 |>mutate(gender_recode =case_when(as.character(c11gender) =="1"~0, # Maleas.character(c11gender) =="2"~1, # FemaleTRUE~NA_real_# Handle any other unexpected cases ) )# Education: Recoding education levels into two categoriestable(dementia1$chd11educ)
dementia1 <- dementia1 |>mutate(edu_recode =case_when( chd11educ %in%c(1, 2, 3) ~1, # Below and high school diploma chd11educ %in%c(4, 5) ~1, # Some college chd11educ %in%c(6, 7, 8) ~2, # College and beyond chd11educ ==c(9) ~3, chd11educ %in%c(-8, -7, -6) ~NA_real_, # Missing or not applicableTRUE~NA_real_# Unhandled cases ) )table(dementia1$edu_recode)
1 2 3
209 234 69
# Marital Status: Recoding marital status into binary (married vs. not married)dementia1 <- dementia1 |>mutate(martstat_recode =case_when( chd11martstat ==1~0, # Married chd11martstat %in%2:6~1, # Not married (single, divorced, etc.) chd11martstat %in%c(-8, -6) ~NA_real_, # Missing or not applicableTRUE~NA_real_# Unhandled cas ) )table(dementia1$martstat_recode)
#recoding: 1(everyday), 2(most day), 3(someday)–1; 4(rarely),5(never)0 #C11 CA1 HOW OFT HELP WITH CHORES (cca11hwoftchs) #C11 CA2 HOW OFTEN SHOPPED FOR SP (cca11hwoftshp) #C11 CA6 HOW OFT HELP PERS CARE (cca11hwoftpc ) #C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) #C11 CA7 HOW OFT HLP GTNG ARD HOME (cca11hwofthom) #11 CA9 HOW OFTEN DROVE SP (cca11hwoftdrv) #C11 CA10 OFTN WENT ON OTH TRANSPR (cca11hwoftott )
#caregiver’s caregiving activities cca11hwoftchs #C11 CA1 HOW OFT HELP WITH CHORES cca11hwoftshp #C11 CA2 HOW OFTEN SHOPPED FOR SP cca11hwoftpc #C11 CA6 HOW OFT HELP PERS CARE cca11hwofthom #C11 CA7 HOW OFT HLP GTNG ARD HOME cca11hwoftdrv #11 CA9 HOW OFTEN DROVE SP cca11hwoftott #C11 CA10 OFTN WENT ON OTH TRANSPR
4 caregiver’s features
che11enrgylmt #Energy often limited cac11diffphy # Caregiver physical difficulty helping cac11exhaustd #Caregiver exhausted at night cac11toomuch #Care more than can handle cac11uroutchg #Care routine then changes cac11notime #No time for self (-8,-7,-6) cac11diffemlv #Caregiver emotional difficulty (-1, 1-5) cpp11hlpkptgo #Kept from going out (-6-1, 1,2) che11health #General health (-8, 1-5) che11sleepint #Interrupted sleep (-8,-7,-6, 1-5) op11numhrsday #Number of hours per day help (-7,-1, 1-6) op11numdaysmn #Number of days help per month (-7,-1 1-6)
#Sociodemographic features (persons living with dementia and caregiver) table(dementia1$cac11notime)
op11leveledu #Caregiver education (na -1, 1-5) cac11diffinc #Caregiver financial difficulties (-8,-7,-6, binary 1,2) ew11progneed1 #Persons living with dementia received food stamps (-8, -7 binary) ew11finhlpfam #Persons living with dementia financial help from family (-8, -7) binary mc11havregdoc #Persons living with dementia have a regular doctor (1,2binary) hc11hosptstay #Persons living with dementia hospital stay in last (1,2) 12-months hc11hosovrnht #Persons living with dementia number of hospital stays (-7, -1, 1-6 times)
Outcomes: Caregivers’ anxiety and depressive symptoms are measured by two questions each.First, anxiety was measured Generalized Anxiety Disorder-2 (GAD-2) Scale which consists of two questions. Since the NHATS provided GAD-2 data, this study utilized it to measure anxiety levels among care recipients. Each item on the scale is rated on a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting in a total score between 0 and 6. Higher scores correspond to greater anxiety, with a total GAD-2 score of 3 or more indicating anxiety.
The care recipients’ depression was evaluated using the Patient Health Questionnaire-2 (PHQ-2) Scale. Given that the NHATS included PHQ-2, this study utilized it to measure depression in care recipients. Each item on the scale was measured with a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting a total score between 0 and 6, with higher scores indicating more severe depression. A PHQ-2 score ranges from 0-6. The authors identified a score of 3 as the optimal cutpoint when using the PHQ-2 to screen for depression. If the score is 3 or greater, major depressive disorder is likely.
# Sum the two questions for GAD2dementia1$total_gad2 <- dementia1$che11fltnervs + dementia1$che11fltworry# Recode the combined variable using a cut-off of 3dementia1$gad2_cg_cat <-ifelse(dementia1$total_gad2 <3, 0, 1)table(dementia1$gad2_cg_cat)
0 1
279 249
summary(dementia1$gad2_cg_cat) #1 ~ anxiety
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.4716 1.0000 1.0000 251
# Sum of the two questions for PHQ2 (che11fltltlin + che11fltdown) dementia1$total_phq2 <- dementia1$che11fltltlin+ dementia1$che11fltdown#Recode the combined variable using a cut-off of 3dementia1$phq2_cg_cat <-ifelse(dementia1$total_phq2 <3, 0, 1)table(dementia1$phq2_cg_cat)
0 1
276 252
summary(dementia1$phq2_cg_cat) #1 ~ depression
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.4773 1.0000 1.0000 251
Data analysis
For data analysis, we first conducted descriptive analyses, including means, standard deviations, ranges, and percentages, to summarize the dataset. To investigate how caregivers’ social strains and caregiver-level factors influence caregiver depression, we performed logistic regression analyses. Guided by the conceptual framework of this study, univariate logistic regression analyses were employed to identify caregivers’ social strains and caregiver-level factors significantly associated with caregiver anxiety and depression, controlling for care recipient-level factors. Variables with a p-value below 0.05 in the univariate analyses were included in the subsequent multivariate logistic regression model. The multivariate model was then constructed to determine which factors most strongly influenced caregiver anxiety and depression. All statistical analyses were conducted using R, with statistical significance set at a p-value of less than 0.05.
4.1
#choosing only one caregiver for each participantfinal <- dementia1 |>group_by(spid) |>slice_head(n =1) |>ungroup()#total 563 caregivers #creating subset to work more effectivelylibrary(dplyr)selected_vars <-c("spid","demclas", "cca11hwoftchs", "cca11hwoftshp", "cca11hwoftpc","cca11hwofthom", "cca11hwoftdrv", "cca11hwoftott","che11enrgylmt", "cac11diffphy", "cac11exhaustd","cac11toomuch", "cac11uroutchg", "cac11notime","cac11diffemlv", "cpp11hlpkptgo", "che11health", "che11sleepint", "ew11progneed1", "ew11finhlpfam", "mc11havregdoc", "hc11hosptstay", "hc11hosovrnht", "cac11diffinc", "pa11hlkepfvst", "pa11hlkpfrclb", "pa11hlkpgoenj", "pa11hlkpfrwrk", "pa11hlkpfrvol", "pa11prcranoth", "race_recode", "gender_recode", "edu_recode", "chd11dage", "martstat_recode", "phq2_cg_cat", "gad2_cg_cat")dementia_subset <- dementia1 |>select(all_of(selected_vars))head(dementia_subset)
# A tibble: 6 × 37
spid demclas cca11hwoftchs cca11hwoftshp cca11hwoftpc cca11hwofthom
<dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 10000036 1 NA NA NA NA
2 10000041 2 1 [1 EVERY DAY] 1 [1 EVERY DAY] 1 [1 EVERY … 1 [1 EVERY …
3 10000041 2 5 [5 NEVER] 4 [4 RARELY] 5 [5 NEVER] 3 [3 SOME D…
4 10000051 1 1 [1 EVERY DAY] 3 [3 SOME DAYS] 2 [2 MOST D… 3 [3 SOME D…
5 10000064 1 2 [2 MOST DAYS] 5 [5 NEVER] 3 [3 SOME D… 3 [3 SOME D…
6 10000064 1 1 [1 EVERY DAY] 1 [1 EVERY DAY] 5 [5 NEVER] 3 [3 SOME D…
# ℹ 31 more variables: cca11hwoftdrv <dbl+lbl>, cca11hwoftott <dbl+lbl>,
# che11enrgylmt <dbl+lbl>, cac11diffphy <dbl+lbl>, cac11exhaustd <dbl+lbl>,
# cac11toomuch <dbl+lbl>, cac11uroutchg <dbl+lbl>, cac11notime <dbl+lbl>,
# cac11diffemlv <dbl+lbl>, cpp11hlpkptgo <dbl+lbl>, che11health <dbl+lbl>,
# che11sleepint <dbl+lbl>, ew11progneed1 <dbl+lbl>, ew11finhlpfam <dbl+lbl>,
# mc11havregdoc <dbl+lbl>, hc11hosptstay <dbl+lbl>, hc11hosovrnht <dbl+lbl>,
# cac11diffinc <dbl+lbl>, pa11hlkepfvst <dbl+lbl>, pa11hlkpfrclb <dbl+lbl>, …
# Select only numeric columnsnumeric_dementia1 <- dementia_subset |>select(where(is.numeric))# Compute the correlation matrixcor_matrix <-cor(numeric_dementia1, method ="kendall", use ="pairwise.complete.obs")# Filter correlations above a thresholdhigh_corr <-which(abs(cor_matrix) >0.8&abs(cor_matrix) <1, arr.ind =TRUE)high_corr_pairs <-data.frame(Var1 =rownames(cor_matrix)[high_corr[, 1]],Var2 =colnames(cor_matrix)[high_corr[, 2]],Correlation = cor_matrix[high_corr])print(high_corr_pairs)
# Install and load pheatmap if necessarylibrary(pheatmap)# Create the heatmappheatmap(cor_matrix, color =colorRampPalette(c("blue", "white", "red"))(50))
Since my outcomes are GAD-2 and PHQ-2, I am most focused on those variables surrounding the square: che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, and ew11progneed.
Pearson’s correlation matrices were presented as heatmaps for both rounds (5 and 7) to visually assess the data and evaluate the independence of variables (Supplementary Figure 2a and b). The Pearson’s correlation coefficient quantifies the linear relationship between two continuous variables, with values ranging from −1 to +1. A coefficient of 0 indicates no linear correlation, while negative and positive values represent negative and positive correlations, respectively. A p-value of < .05 was used to define statistical significance. Then I selected the most representative variable to reduce redundancy in concepts among highly correlated items. The negative values in the dataset were not addressed in this step.
4.1.1 Imputation
The original dataset includes numerous negative and N/A values, and the sample size is small, necessitating preprocessing before feature selection. To handle the small sample size, negative values were transformed into N/A values. These N/A values were then imputed based on the type of feature. For continuous variables, N/A values were replaced with the median of the respective column, while the most frequent category level was used to impute N/A values for categorical features:
# Summary and Visualization of 'demclas'table(final$demclas) #1 probable dementia #2 dementia diagnosis
1 2
331 232
# Adding labels for clarityfinal$demclas_label <-factor( final$demclas, levels =c(1, 2), labels =c("Probable Dementia", "Possible Dementia"))# Plotting with the updated legendggplot(final, aes(x = demclas_label)) +geom_histogram(stat ="count", color ="grey", fill ="grey", alpha =0.7 ) +labs(title ="Distribution of Dementia Classifications",x ="Dementia Class",y ="Count",fill ="Dementia Type" ) +theme_minimal()
Warning in geom_histogram(stat = "count", color = "grey", fill = "grey", :
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
4.2.1.1 Graph
# Dementia class analyses#labeling variables.final1 <- final_dementia |>mutate(race_recode =factor(race_recode, levels =c("0", "1", "2", "3"),labels =c("White, non-Hispanic", "Black, non-Hispanic", "Other", "Hispanic")),gender_recode =factor(gender_recode, levels =c("0", "1"),labels =c("Male", "Female")),edu_recode =factor(edu_recode, levels =c("1", "2", "3" ),labels =c("Below high school", "Some college", "College and beyond")),martstat_recode =factor(martstat_recode, levels =c("0", "1"),labels =c("Not Married", " Married")) )# 1. Bar Plot for Categorical Variables (e.g., Race, Gender, Education, Marital Status)# Raceggplot(final1, aes(x = race_recode, fill = demclas)) +geom_bar() +# Default bar plot based on countslabs(title ="Race Distribution by Demographic Class",x ="Race",y ="Count") +scale_fill_brewer(palette ="Set3") +theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
# Gender - Count plotggplot(final1, aes(x = gender_recode, fill = demclas)) +geom_bar() +# Default bar plot based on countslabs(title ="Gender Distribution by Demographic Class",x ="Gender",y ="Count") +scale_fill_brewer(palette ="Set3") +theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
# Education - Count plotggplot(final1, aes(x = edu_recode, fill = demclas)) +geom_bar() +# Default bar plot based on countslabs(title ="Education Distribution by Demographic Class",x ="Education",y ="Count") +scale_fill_brewer(palette ="Set3") +theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
# Marital Status - Count plotggplot(final1, aes(x = martstat_recode, fill = demclas)) +geom_bar() +# Default bar plot based on countslabs(title ="Marital Status Distribution by Demographic Class",x ="Marital Status",y ="Count") +scale_fill_brewer(palette ="Set3") +theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
# 2. Boxplot for Continuous Variable (Age)ggplot(final1, aes(x = demclas, y = chd11dage, fill = demclas)) +geom_boxplot() +labs(title ="Age Distribution by Demographic Class",x ="Demographic Class",y ="Age") +scale_fill_brewer(palette ="Set3") +theme_minimal()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
This exploratory study employs multiple machine learning techniques—including correlation matrix analysis, glm, and random forest (RF)—to identify key predictors of caregiver depression and anxiety. This multipronged approach is essential given the diverse types of data in this study. Machine learning methods, being inductive, support hypothesis generation and allow for systematic feature reduction by excluding variables deemed unimportant across multiple methods. This approach refines the feature set, enhancing the interpretability and predictive accuracy of the models.
### anxiety_glm_wf <-workflow() |>add_model(lr_class_spec) |>add_formula(gad2_cg_cat ~ .)# Fit the workflow to the test dataanxiety_glm_fit <- anxiety_glm_wf |>fit(data = anxiety_test)# Generate predictions with probabilitiesanxiety_glm_predicted <-predict(anxiety_glm_fit, new_data = anxiety_test, type ="prob")anxiety_glm_predicted
# Combine into a single data frameanxiety_glm_pred_values <-bind_cols(truth = anxiety_test$gad2_cg_cat, # Actual values of the outcome variablepredict(anxiety_glm_fit, new_data = anxiety_test), # Predicted class labelspredict(anxiety_glm_fit, new_data = anxiety_test, type ="prob") # Predicted probabilities)print(anxiety_glm_pred_values)
#Prediction on the test dataanxiety.lr.pred.values.test <-bind_cols(truth = anxiety_test$gad2_cg_cat,predict(lr_class_fit, anxiety_test),predict(lr_class_fit, anxiety_test, type ="prob"))anxiety.lr.pred.values.test
### depression_glm_wf <-workflow() |>add_model(lr_class_spec) |>add_formula(phq2_cg_cat ~ .)# Fit the workflow to the test datadepression_glm_fit <- depression_glm_wf |>fit(data = depression_test)# Generate predictions with probabilitiesdepression_glm_predicted <-predict(depression_glm_fit, new_data = depression_test, type ="prob")depression_glm_predicted
# Combine into a single data framedepression_glm_pred_values <-bind_cols(truth = depression_test$phq2_cg_cat, # Actual values of the outcome variablepredict(depression_glm_fit, new_data = depression_test), # Predicted class labelspredict(depression_glm_fit, new_data = depression_test, type ="prob") # Predicted probabilities)print(depression_glm_pred_values)
#Prediction on the test datadepression.lr.pred.values.test <-bind_cols(truth = depression_test$phq2_cg_cat,predict(lr_class_fit, depression_test),predict(lr_class_fit, depression_test, type ="prob"))depression.lr.pred.values.test
I specified a random forest model with 1,000 trees and a minimum node size of 5, using the randomForest engine for classification and enabling variable importance. I trained the model on the anxiety_train dataset and visualized the top predictors based on Mean Decrease Gini.
For model validation, I performed 20-fold cross-validation, integrating the random forest model into a workflow. The model achieved a strong ROC-AUC score of 0.91, demonstrating excellent classification performance.
4.2.4.1 Anxiety
library(tune)library(parsnip)library(recipes)library(rsample)library(workflows)rf_spec<-rand_forest(trees=1000, min_n=5)|>set_engine("randomForest", importance=TRUE)|>set_mode("classification")rf_fit<-rf_spec|>fit(gad2_cg_cat ~ ., data=anxiety_train)## top variablesrf_fit|>extract_fit_engine()|>vip()
# Fit the random forest model on the full training dataanxiety_rf_fit <- rf_spec |>fit(gad2_cg_cat ~ ., data = anxiety_train)anxiety_rf_fit
parsnip model object
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5, x), importance = ~TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3
OOB estimate of error rate: 16.89%
Confusion matrix:
0 1 class.error
0 295 31 0.09509202
1 45 79 0.36290323
#testinganxiety_rf_pred_values <-bind_cols(truth = anxiety_test$gad2_cg_cat, # Actual values of the outcome variablepredict(anxiety_rf_fit, new_data = anxiety_test), # Predicted class labelspredict(anxiety_rf_fit, new_data = anxiety_test, type ="prob") # Predicted probabilities)roc_auc(anxiety_rf_pred_values, truth, .pred_0)
I did cross-validation predictions and calculated the ROC-AUC for each fold. Then, I plotted the ROC curve for the cross-validation results. Afterward, I fitted the random forest model to the full training data (anxiety_train) and predicted class labels and probabilities on the test data, achieving a ROC-AUC of 0.8577. Finally, you plotted the ROC curve for the test set predictions.
# Collect metrics from the resampling resultsrf_wf_fit_cv_anxiety_metrics <-collect_metrics(rf_wf_fit_cv_anxiety)#roc_auc = 0.9213547 # If you need predictions for further analysisrf_wf_fit_cv_anxiety_preds <-collect_predictions(rf_wf_fit_cv_anxiety)
The random forest model demonstrated strong performance with consistent ROC AUC values across different evaluations: 0.917, 0.9154, and 0.9213. These results indicate the model’s ability to effectively distinguish between the classes. The ROC curve further confirmed this, showing the model’s good classification ability.
I created a random forest model with 1000 trees and a minimum node size of 5 to classify depression using the phq2_cg_cat variable. After fitting the model to the training data, I extracted the importance of the top variables, using the MeanDecreaseGini measure to identify the most influential features.
For evaluation, I performed 20-fold cross-validation on the training data, resulting in a ROC AUC of 0.8975, indicating good performance in classifying depression.
# Fit the random forest model on the full training datadepression_rf_fit <- rf_spec |>fit(phq2_cg_cat ~ ., data = depression_train)depression_rf_fit
parsnip model object
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5, x), importance = ~TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3
OOB estimate of error rate: 18.44%
Confusion matrix:
0 1 class.error
0 302 31 0.09309309
1 52 65 0.44444444
#testingdepression_rf_pred_values <-bind_cols(truth = depression_test$phq2_cg_cat, # Actual values of the outcome variablepredict(depression_rf_fit, new_data = depression_test), # Predicted class labelspredict(depression_rf_fit, new_data = depression_test, type ="prob") # Predicted probabilities)roc_auc(depression_rf_pred_values, truth, .pred_0)
I collected the predictions from the 20-fold cross-validation of the random forest model and calculated the ROC AUC for each fold, confirming the model’s performance. The ROC curve was plotted to visualize the classification ability.
After fitting the model on the full training data, I tested it on the separate test set. The model achieved a ROC AUC of 0.9109 on the test data, demonstrating strong classification performance. The ROC curve for the test predictions further supported this finding.
#roc_auc = 0.89855 # If you need predictions for further analysisrf_wf_fit_cv_depression_preds <-collect_predictions(rf_wf_fit_cv_depression)
The ROC AUC for the cross-validation results was 0.8995, indicating good model performance. After including the save_pred = TRUE control option, which stores predictions for each fold, I recalculated the ROC AUC, obtaining 0.8885.
4.2.4.3 Graph
Anxiety
# ROC curve for GLM training setanxiety_roc_glm_training <- anxiety_glm_pred_values |>roc_curve(truth, .pred_0)# ROC curve for GLM cross-validation setanxiety_roc_glm_cv <- anxiety.lr.pred.values.test |>roc_curve(truth, .pred_0)# ROC curve for RF training setanxiety_roc_rf_training <- anxiety_rf_pred_values|>roc_curve(truth, .pred_0)# ROC curve for RF cross-validation setanxiety_roc_rf_cv <- rf_wf_fit_cv_anxiety_preds |>roc_curve(truth = gad2_cg_cat, .pred_0)ggplot() +geom_path(data = anxiety_roc_glm_training, aes(x =1- specificity, y = sensitivity), color ="blue", linetype ="solid", size =1, label ="Logistic Regression (Training)") +geom_path(data = anxiety_roc_glm_cv, aes(x =1- specificity, y = sensitivity), color ="green", linetype ="dashed", size =1, label ="Logistic Regression (10-fold CV)") +geom_path(data = anxiety_roc_rf_training, aes(x =1- specificity, y = sensitivity), color ="red", linetype ="solid", size =1, label ="Random Forest (Training)") +geom_path(data = anxiety_roc_rf_cv, aes(x =1- specificity, y = sensitivity), color ="purple", linetype ="dashed", size =1, label ="Random Forest (10-fold CV)") +coord_equal() +theme_bw() +labs(title ="Anxiety ROC Curves for Logistic Regression and Random Forest",x ="1 - Specificity (False Positive Rate)",y ="Sensitivity (True Positive Rate)",color ="Model", linetype ="Type") +theme(legend.position ="right") +geom_text(aes(x =0.7, y =0.2, label ="Logistic Regression (Training)"), color ="blue", size =2) +geom_text(aes(x =0.7, y =0.15, label ="Logistic Regression (10-fold CV)"), color ="green", size =2, linetype ="dashed") +geom_text(aes(x =0.7, y =0.1, label ="Random Forest (Training)"), color ="red", size =2) +geom_text(aes(x =0.7, y =0.05, label ="Random Forest (10-fold CV)"), color ="purple", size =2, linetype ="dashed")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in geom_text(aes(x = 0.5, y = 0.15, label = "Logistic Regression
(10-fold CV)"), : Ignoring unknown parameters: `linetype`
Warning in geom_text(aes(x = 0.5, y = 0.05, label = "Random Forest (10-fold
CV)"), : Ignoring unknown parameters: `linetype`
Limitations:
Despite the robust analysis and promising results, there are several limitations to consider. First, the sample size for both the training and testing sets may not fully capture the diversity of the target population, potentially limiting the generalizability of the findings. Second, while cross-validation helps to reduce overfitting, it does not account for all potential biases or variations within different subgroups of the data. Third, missing data, if not adequately handled, could have introduced bias or reduced the accuracy of model predictions. Furthermore, the models used are based on observed data and may not fully capture complex relationships or interactions that could emerge with a broader set of variables or external factors. Lastly, the reliance on specific features may limit the scope of the model’s performance in more dynamic or varied real-world scenarios.
4.3 Conclusion
In conclusion, we performed 20-fold cross-validation on both anxiety and depression datasets using GLM and random forest models. For both outcomes, we achieved high AUC values, indicating strong model performance. The variable importance analysis highlighted key features influencing predictions. Additionally, the ROC curve plots demonstrated good classification accuracy for both anxiety (AUC = 0.91) and depression (AUC = 0.91) models. These results suggest that the random forest models are effective in predicting anxiety and depression based on the available features.